Walking into the liquor store, the wine section can be a daunting experience. There are hundreds of varieties of wines and the prices range from a few dollars to a few thousands. Ten minutes of awkwardly reading the bottles and prices and you are out! You've spent 50 dollars and it all the wine you got tastes like crap!
How would you like to better that experience? What if there was a way to make purchasing wine easier and more efficient?
In our tutorial we hope to find the best wine for different budgets, and to find similar wines to ones you may already enjoy.That way no matter how large or how little your budget is, you know you're getting the best wine your money can buy.Not only saving us money but also giving our pallets something to enjoy.
This dataset consists of 130 thousand listings of wines. The prices range from 4 dollars to 3300 dollars per bottle. The wines are rated by connosieurs and span many countries world wide. More information can be found at https://www.kaggle.com/zynicide/wine-reviews?select=winemag-data_first150k.csv
Motivation: We believe that this is a good example of how data science can be applied to optimize everyday tasks.The tools shown in this tutorial can also be applied to topics and things beyond wine. Together the first and second section of our tutorial allow online retailers and subscription-based retail services to find what the user would like the most as well as suggest to the user what other things they might enjoy.
This tutorial is broken down into two parts. The first section handles finding the most similar wine to any given wine based on variety, points, and price.handles finding the best wine for a specific price range for the variety of wine of your choosing. The second section handles finding the best wine for a specific price range for the variety of wine of your choosing.
import pandas as pd
import matplotlib.pylab as plt
import matplotlib.axes as ax
import seaborn as sns
import numpy as np
wine_table = pd.read_csv('/winemag-data_first150k.csv')
#Provide the ability to show the entirety of the data when not looking at the head
pd.set_option("max_columns", None)
pd.set_option("max_rows", None)
wine_table.head()
Our aim is to use the Euclidean Distance formula to find a wine "closest" to our search wine. Here closest does not mean physically in our list, but rather on similarity of three factors: Price, Points and Variety.
You may be familiar with a simple distance formula for 2 dimensions, however we have here a third dimension: Variety. Variety is not a numerical data source. How does one analyse it as such?
We can do so by assigning a unique numeric value for every variety, thus allowing us to treat it as a number
# Finding unique varieties of wine in data
print('Variety:')
varieties = list(set(wine_table['variety'].to_list()))
wine_table['variety'].sort_values().unique()
# Displaying a sample of varieties
print(varieties[0:50])
# We will be using the index number from this set to act as the unique number identifier for each variety
print(varieties.index('Malbec'))
Here in the data frame below, one can see two Malbec wines with very similar points and prices, our goal is to be able to find the most similar wine to our search wine as possible.
First we need to create an (x,y,z) coordinate to represent the (variety,price,points)
# Given a specific wine, we want to be able to predict the wine closest or most similar to the specified wine based on variety, price and score
is_Malbec = wine_table['variety']=='Malbec'
malbec_table = wine_table[is_Malbec]
# Checking to see that filter worked
print('Malbec:')
print(malbec_table['variety'].sort_values().unique())
malbec_table.head()
# Creating points from wine rows to find euclidean distance
# Taking the example of an under 20 dollar bottle of Malbec (This corresponds to row 115 in the data set)
wine = wine_table.loc[115]
print(wine)
print("===================================")
print(wine['variety'])
print(varieties.index(wine['variety']))
print(int(wine['price']))
print(wine['points'])
print("===================================")
variety_id = varieties.index(wine['variety'])
# intializing point for search wine
search = np.array(((varieties.index(wine['variety'])), (int(wine['price'])), (wine['points'])))
print("search element:")
print(search)
Here we use the Euclidean distance between our search wine and any other wine to see which one is the most similar. As you can see, from the two wines below, a wine of the same variety and similar price and points will have a much smaller "distance" from our search wine than that of a wine of different variety, and less similar price and points
A more basic example using simple integers can be found at: https://www.w3resource.com/python-exercises/math/python-math-exercise-79.php
# Calculating Euclidean distance for two random wines
elem1 = wine_table.loc[86]
print("first random wine: ")
print(elem1)
element1 = np.array(((varieties.index(elem1['variety'])), (int(elem1['price'])), (elem1['points'])))
print("")
print("first random wine point: ")
print(element1)
print("====================================================================================")
elem2 = wine_table.loc[190]
print("second random wine: ")
print(elem2)
element2 = np.array(((varieties.index(elem2['variety'])), (int(elem2['price'])), (elem2['points'])))
print("")
print("second random wine point: ")
print(element2)
print("====================================================================================")
print("")
print("search element:")
print(search)
print("====================================================================================")
# printing Euclidean distance
print("")
print("Difference between first random wine and our wine")
dist1 = np.linalg.norm(search - element1)
print(dist1)
print("")
print("Difference between second random wine and our wine")
dist2 = np.linalg.norm(search - element2)
print(dist2)
Here we see the difference between wines, using this we can find the wine with the smallest difference from our search wine, not including itself. We can use the list of only malbec just to save ourselves a little time.
# looping through data-frame to find most similar wine and recording the wine with the smallest difference
print("Search Element:")
print("")
print(wine['variety'])
print(varieties.index(wine['variety']))
print(int(wine['price']))
print(wine['points'])
print("")
print(search)
print("")
print("====================================================================================")
print("Total number of Wines:")
print(len(wine_table.index))
print("")
# For the sake of saving calculation time and space I will be running the distance algorithm on a filtered set of wines
# This set is the set of wines of the same variety
print("Total number of Malbec wines:")
print(len(malbec_table.index))
# Here we could have used the entire table but for the sake of the tutorial and calculation time we will use the filered table
def findSimilar():
min_row = 0
min_dist = 100.0
for x in malbec_table.dropna().index:
elem = wine_table.loc[x]
point = np.array(((varieties.index(elem['variety'])), (int(elem['price'])), (elem['points'])))
dist = np.linalg.norm(search - point)
if dist < min_dist and dist > 0:
min_dist = dist
min_row = x
return min_row
##
similar = findSimilar()
print("Wine most similar to our input row is:")
most_sim = wine_table.loc[similar]
print(wine_table.loc[similar])
print("Search Wine:")
print("")
print(wine['variety'])
print(int(wine['price']))
print(wine['points'])
print("")
print(wine)
print("______________________________")
print("Most Similar Wine:")
print("")
print(most_sim['variety'])
print(int(most_sim['price']))
print(most_sim['points'])
print("")
print(most_sim)
print("")
Here we see our Input wine, a 14 dollar Malbec from Mendoza Province is most similar to another 14 dollar Malbec from Central Valley
Our goal when beginning to examine data for customers who are new to drinking wine is to find the overall best taste for price. By finding the best tasting wine at a reasonable price, this may help customers explore their interest in wine while not having to spend too much money. By gathering data within price ranges and then finding what wine scores the best for its price within those ranges we are able to find what types of wine are the best at certain price ranges. Our scatter plots shown below each have one graph for our set price ranges that we defined. These plots can then be broken into four quadrants and based off of the location of the plot, we can draw conclusions about the type of wine at that price range. Points are plotted based off of average score and price for the type of wine tested. Varieties in the top left quadrant represent a high score and low price which is the most efficient use of customer’s money. The top right quadrant shows high score for a high price and can be useful for people who want to explore more expensive wine. The bottom left quadrant groups types of wine that are bad tasting and low price for this price range. Finally, the bottom right quadrant is the least desired types of wine because they have a low score and high price which is the least efficient way to spend money when aiming for good tasting wine.
#Create custom cut intervals for easier use by a customer
cut_labels = ['0-20', '20-50', '50-100', '100-200','>500']
cut_bins = [0, 20, 50, 100, 200,500]
wine_table['custom_price_range'] = pd.cut(wine_table['price'].to_list(), bins=cut_bins, labels=cut_labels)
#Lists of intervals and varieties that can be looped without duplicates
intervals = wine_table['custom_price_range']
intervals = list(dict.fromkeys(intervals))
variety = wine_table['variety']
variety = list(dict.fromkeys(variety))
for curr in intervals:
#Query of a mini data table where all custom price ranges match the curr in the interval loop
qry = wine_table[wine_table['custom_price_range'] == curr]
#Loop through varieties within this interval
varieties = qry['variety']
varieties = list(dict.fromkeys(varieties))
for var in varieties:
#For each variety in this custom price range, gather the price and points
qry2 = qry[qry['variety'] == var]
pay = qry2['price'].to_list()
points = qry2['points'].to_list()
pay_sum = 0
points_sum = 0
for p in pay:
pay_sum += p
for p in points:
points_sum += p
x = []
#Take the average of the price and points for this variety at this price range
x.append(pay_sum/len(pay))
y = []
y.append(points_sum/len(points))
plt.title('Varieties\' Average Points per Price in the Price Range: ' + curr)
plt.xlabel('Average Price')
plt.ylabel('Average Points')
plt.scatter(x, y)
plt.figure(figsize=(10, 8))
plt.show()
Wine shoppers typically have a favorite or a select few types of wine they prefer to drink compared to others. Being able to find quality wine of their type for a good price is what is most important to shoppers. What we explore in our code here is looking at the price range and what the bulk of prices are listed at for certain types of wine depending on the quality the customer desires. By creating a violin plot representing this, we are able to visually see the price ranges and what wines are typically listed at for their quality. This is helpful in the future when we decide to implement a filter that allows the user to pinpoint their desired price and points range for their favorite type(s) of wine.
#violin
points = wine_table['points'].to_list()
wine_table['points_range'] = pd.cut(points, 10)
varieties = wine_table['variety']
varieties = list(dict.fromkeys(varieties))
#Create graphs for each variety and their target price for points
for var in varieties:
qry = wine_table[wine_table['variety'] == var]
plt.figure(figsize=(10, 8))
sns.violinplot(x=qry['points_range'], y=qry['price'], data=qry).set_title(var)
plt.xticks(rotation=45)
plt.show()
This code cells represents how our analysis of the wine tasting data can help people find the best quality wine for their price range. An example of 15 dollars falls under the category of 0 to 20 dollars depending on local tax legislature. After searching through the database, our program is able to return a row representing a Syrah wine in Washington, USA that received a score of 96 points for just 20 dollars spent. This is an example of filtering that we are able to do in order to provide wine drinkers with tools helping them to explore new bottles that they may have been seeking all along. Plenty more filters can be added in order to even further personalize and narrow the search results down, however implementation of this may have seemed redundant. Our code shows what we are capable of and how a shopper may use this tool.
#Find the best wine for $15
target = 15.0
range = ''
if 0.0 <= target < 20.0:
range = '0-20'
elif 20.0 <= target < 50.0:
range = '20-50'
elif 50.0 <= target < 100.0:
range = '50-100'
elif 100.0 <= target < 200.0:
range = '100-200'
elif 200.0 <= target < 500.0:
range = '200-500'
else:
range = '>500'
points = 0
qry = wine_table[wine_table['custom_price_range'] == range]
for index, row in qry.iterrows():
if row['points'] > points:
points = row['points']
best_value = row
print(best_value)
The last section of code added for our projects reflects changes made to the dataset in order to better optimize and fit our data for our needs. Columns added with ranges to provide a more personalized experience for the user was our goal in helping people find great wine that is of their type and budget.
#Show our updated data table with added columns
wine_table.head()
From this tutorial we can make wine shopping more efficient, allowing you to get the best tasting wine for your budget. We beleive we have achieved our goal in optimizing wine shopping as we cater to all wine consumers both novices and experts alike.
These tools are not limited to wine. Retailers can use these tools to enhance customer experiences as well increase their chances for sales. Customers also benefit as their own tastes are taken into consideration, and can be rest-assured that they are not overspending, it also narrows down the search, allowing customers to find what they want quicker.